What is Our Data?

Using the dataset Spotify Unpopular Songs. It contains audio characteristics of many unpopular songs such as perceived intensity, key, decibels, popularity, and more.

Here, we’re going to attempt to see if we can manage to find a way to sort songs into general classes (horrible, bad, meh, and passable) based off their popularity scores.

Exploring Our Data

Initial Processing

In this notebook, we will be performing dimensionality reduction to attempt to improve performance and accuracy in kNN regression.

Let’s read in the data and take a peek.

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
df <- read.csv("data/unpopular_songs.csv")
summary(df)
##   danceability        energy               key            loudness      
##  Min.   :0.0000   Min.   :0.0000203   Min.   : 0.000   Min.   :-51.808  
##  1st Qu.:0.4420   1st Qu.:0.3790000   1st Qu.: 2.000   1st Qu.:-13.796  
##  Median :0.6020   Median :0.5690000   Median : 5.000   Median : -9.450  
##  Mean   :0.5725   Mean   :0.5497713   Mean   : 5.223   Mean   :-11.359  
##  3rd Qu.:0.7300   3rd Qu.:0.7450000   3rd Qu.: 9.000   3rd Qu.: -6.726  
##  Max.   :0.9860   Max.   :1.0000000   Max.   :11.000   Max.   :  3.108  
##       mode        speechiness      acousticness    instrumentalness  
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000000  
##  1st Qu.:0.000   1st Qu.:0.0384   1st Qu.:0.0365   1st Qu.:0.000000  
##  Median :1.000   Median :0.0589   Median :0.2330   Median :0.000133  
##  Mean   :0.641   Mean   :0.1380   Mean   :0.3542   Mean   :0.232943  
##  3rd Qu.:1.000   3rd Qu.:0.1880   3rd Qu.:0.6570   3rd Qu.:0.517000  
##  Max.   :1.000   Max.   :0.9620   Max.   :0.9960   Max.   :1.000000  
##     liveness         valence           tempo        duration_ms     
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.0   Min.   :   4693  
##  1st Qu.:0.0993   1st Qu.:0.2380   1st Qu.: 93.0   1st Qu.: 151152  
##  Median :0.1290   Median :0.4680   Median :117.1   Median : 197522  
##  Mean   :0.2121   Mean   :0.4646   Mean   :117.8   Mean   : 205578  
##  3rd Qu.:0.2680   3rd Qu.:0.6850   3rd Qu.:138.9   3rd Qu.: 244428  
##  Max.   :0.9990   Max.   :0.9950   Max.   :239.5   Max.   :3637277  
##    explicit           popularity      track_name        track_artist      
##  Length:10877       Min.   : 0.000   Length:10877       Length:10877      
##  Class :character   1st Qu.: 1.000   Class :character   Class :character  
##  Mode  :character   Median : 2.000   Mode  :character   Mode  :character  
##                     Mean   : 3.079                                        
##                     3rd Qu.: 3.000                                        
##                     Max.   :18.000                                        
##    track_id        
##  Length:10877      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

We can see we largely have quantitative data, with a few exceptions. Not all of these are useful, but we’ll make whether or not its explicit a factor for now, as well as popularity (after we look at correlation). We’ll also look for correlated values.

df$explicit <- as.factor(df$explicit)
summary(df)
##   danceability        energy               key            loudness      
##  Min.   :0.0000   Min.   :0.0000203   Min.   : 0.000   Min.   :-51.808  
##  1st Qu.:0.4420   1st Qu.:0.3790000   1st Qu.: 2.000   1st Qu.:-13.796  
##  Median :0.6020   Median :0.5690000   Median : 5.000   Median : -9.450  
##  Mean   :0.5725   Mean   :0.5497713   Mean   : 5.223   Mean   :-11.359  
##  3rd Qu.:0.7300   3rd Qu.:0.7450000   3rd Qu.: 9.000   3rd Qu.: -6.726  
##  Max.   :0.9860   Max.   :1.0000000   Max.   :11.000   Max.   :  3.108  
##       mode        speechiness      acousticness    instrumentalness  
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000000  
##  1st Qu.:0.000   1st Qu.:0.0384   1st Qu.:0.0365   1st Qu.:0.000000  
##  Median :1.000   Median :0.0589   Median :0.2330   Median :0.000133  
##  Mean   :0.641   Mean   :0.1380   Mean   :0.3542   Mean   :0.232943  
##  3rd Qu.:1.000   3rd Qu.:0.1880   3rd Qu.:0.6570   3rd Qu.:0.517000  
##  Max.   :1.000   Max.   :0.9620   Max.   :0.9960   Max.   :1.000000  
##     liveness         valence           tempo        duration_ms     
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.0   Min.   :   4693  
##  1st Qu.:0.0993   1st Qu.:0.2380   1st Qu.: 93.0   1st Qu.: 151152  
##  Median :0.1290   Median :0.4680   Median :117.1   Median : 197522  
##  Mean   :0.2121   Mean   :0.4646   Mean   :117.8   Mean   : 205578  
##  3rd Qu.:0.2680   3rd Qu.:0.6850   3rd Qu.:138.9   3rd Qu.: 244428  
##  Max.   :0.9990   Max.   :0.9950   Max.   :239.5   Max.   :3637277  
##   explicit      popularity      track_name        track_artist      
##  False:7945   Min.   : 0.000   Length:10877       Length:10877      
##  True :2932   1st Qu.: 1.000   Class :character   Class :character  
##               Median : 2.000   Mode  :character   Mode  :character  
##               Mean   : 3.079                                        
##               3rd Qu.: 3.000                                        
##               Max.   :18.000                                        
##    track_id        
##  Length:10877      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
cor(df[c(1,2,3,4,5,6,7,8,9,10,11,12,14)])
##                   danceability      energy          key     loudness
## danceability      1.0000000000  0.10357554  0.001416440  0.384798006
## energy            0.1035755370  1.00000000  0.032847557  0.668247944
## key               0.0014164396  0.03284756  1.000000000  0.020238291
## loudness          0.3847980060  0.66824794  0.020238291  1.000000000
## mode             -0.0424166570 -0.04371262 -0.174170158  0.007144594
## speechiness       0.2880560637  0.06065882 -0.003339108  0.067091927
## acousticness     -0.2537596673 -0.57807060 -0.017360855 -0.491999477
## instrumentalness -0.3345776576 -0.31475687 -0.026367389 -0.547322987
## liveness         -0.2502105046  0.25837921 -0.001745424 -0.018978820
## valence           0.5171426279  0.31726610  0.015964344  0.426772633
## tempo             0.0900580502  0.17122835 -0.003040262  0.202227504
## duration_ms       0.0004830046  0.15201424  0.006044278  0.195281479
## popularity        0.1597255536  0.05469420 -0.002388392  0.149949613
##                           mode  speechiness acousticness instrumentalness
## danceability     -0.0424166570  0.288056064  -0.25375967     -0.334577658
## energy           -0.0437126214  0.060658817  -0.57807060     -0.314756871
## key              -0.1741701578 -0.003339108  -0.01736086     -0.026367389
## loudness          0.0071445943  0.067091927  -0.49199948     -0.547322987
## mode              1.0000000000 -0.087636772   0.03888040     -0.063920945
## speechiness      -0.0876367717  1.000000000  -0.11592434     -0.273849185
## acousticness      0.0388803990 -0.115924341   1.00000000      0.291033539
## instrumentalness -0.0639209452 -0.273849185   0.29103354      1.000000000
## liveness         -0.0241449112  0.050249663  -0.02456814     -0.008284127
## valence           0.0002389504  0.115257854  -0.21538759     -0.335547352
## tempo             0.0171224145  0.038543375  -0.18312285     -0.119385544
## duration_ms       0.0351389868 -0.098355503  -0.11730165     -0.148671815
## popularity       -0.0454684641  0.050489909  -0.11698471     -0.075279942
##                      liveness       valence        tempo   duration_ms
## danceability     -0.250210505  0.5171426279  0.090058050  0.0004830046
## energy            0.258379213  0.3172660977  0.171228345  0.1520142437
## key              -0.001745424  0.0159643436 -0.003040262  0.0060442781
## loudness         -0.018978820  0.4267726333  0.202227504  0.1952814794
## mode             -0.024144911  0.0002389504  0.017122414  0.0351389868
## speechiness       0.050249663  0.1152578541  0.038543375 -0.0983555028
## acousticness     -0.024568144 -0.2153875874 -0.183122846 -0.1173016518
## instrumentalness -0.008284127 -0.3355473521 -0.119385544 -0.1486718149
## liveness          1.000000000 -0.1129996078 -0.029490757  0.0683864612
## valence          -0.112999608  1.0000000000  0.172984416  0.0460316403
## tempo            -0.029490757  0.1729844162  1.000000000  0.0509919444
## duration_ms       0.068386461  0.0460316403  0.050991944  1.0000000000
## popularity       -0.066955096  0.0358241022  0.061602311 -0.0250484441
##                    popularity
## danceability      0.159725554
## energy            0.054694203
## key              -0.002388392
## loudness          0.149949613
## mode             -0.045468464
## speechiness       0.050489909
## acousticness     -0.116984708
## instrumentalness -0.075279942
## liveness         -0.066955096
## valence           0.035824102
## tempo             0.061602311
## duration_ms      -0.025048444
## popularity        1.000000000
df$popularity <- as.factor(df$popularity)

We don’t see a ton of clearly related values, though how many attributes we have does make it difficult to read. We’ll hope that the algorithms will do well at reducing the amount of attributes we have entering into this data.

Let’s take a closer look at popularity, now that its factored.

summary(df$popularity)
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2694 2101 2146 1494  457  309  212  137  112   59   80   45   59  248  544  152 
##   16   17   18 
##   19    5    4

Hmm, a few too many factors. Let’s combine some of these with respect to how many are in each category.

#install.packages("forcats")
library(forcats)
popularityclass <- fct_collapse(df$popularity, horrible=c('0','1'), bad=c('2','3','4','5'), meh=c('6','7','8','9','10','11','12'), passable=c('13','14','15','16','17','18'))

df$popclass <- popularityclass

And now we’ll be sure it worked.

summary(df$popclass)
## horrible      bad      meh passable 
##     4795     4406      704      972
names(df)
##  [1] "danceability"     "energy"           "key"              "loudness"        
##  [5] "mode"             "speechiness"      "acousticness"     "instrumentalness"
##  [9] "liveness"         "valence"          "tempo"            "duration_ms"     
## [13] "explicit"         "popularity"       "track_name"       "track_artist"    
## [17] "track_id"         "popclass"

Cheers! Let’s separate it into training data now.

i <- sample(1:nrow(df),nrow(df)*.8,replace=FALSE)
train <- df[i,]
test <- df[-i,]

Visual Exploration

Now, let’s look at some charts to understand things a bit better.

pairs(df[c(3,4,6,8,9,11)])

plot(density(df$loudness),lwd=2)

plot(density(df$valence),lwd=2)

plot(density(df$tempo),lwd=2)

plot(density(df$speechiness),lwd=2)

We confirm that key, liveliness, and tempo are not very useful. We can now better understand how the data is laid out, and confirmed that correlation is difficult to find. This is why we will be using a kNN model to test dimensionality on this data.

Dimensionality Algorithms

Okay, now let’s run PCA on the data. We have a lot of columns to consider. We’ll center and scale them while we’re at it.

set.seed(2022)
pca_out <- preProcess(train[,1:10], method=c("center","scale","pca"),k=5)
pca_out
## Created from 8701 samples and 10 variables
## 
## Pre-processing:
##   - centered (10)
##   - ignored (0)
##   - principal component signal extraction (10)
##   - scaled (10)
## 
## PCA needed 9 components to capture 95 percent of the variance

We weren’t able to remove much.

Let’s plot what we got. We’ll put them on 3 separate 3d charts.

train_pc <- predict(pca_out,train[,1:10])
test_pc <-  predict(pca_out, test[,1:10])

#install.packages("plotly")
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x=test_pc$PC1, y=test_pc$PC2, z=test_pc$PC3, type="scatter3d", mode="markers",color=test$popclass)
plot_ly(x=test_pc$PC4, y=test_pc$PC5, z=test_pc$PC6, type="scatter3d", mode="markers",color=test$popclass)
plot_ly(x=test_pc$PC7, y=test_pc$PC8, z=test_pc$PC9, type="scatter3d", mode="markers",color=test$popclass)

Things are not looking promising. We can hope that since it wasn’t able to reduce much, though, that using all the predictors it created will help more, even if we can’t visualize it.

Let’s try kNN on it.

library(class)
train_df <- data.frame(train_pc$PC1,train_pc$PC2,train_pc$PC3,train_pc$PC4,train_pc$PC5,train_pc$PC6,train_pc$PC7,train_pc$PC8,train_pc$PC9, train$popclass)
test_df <- data.frame(test_pc$PC1,test_pc$PC2,test_pc$PC3,test_pc$PC4,test_pc$PC5,test_pc$PC6,test_pc$PC7,test_pc$PC8,test_pc$PC9, test$popclass)
predknn <- knn(train=train_df[,1:9], test=test_df[,1:9], cl=train_df[,10], k=5)
mean(predknn==test$popclass)
## [1] 0.4632353
confusionMatrix(data=predknn, reference=test$popclass)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction horrible bad meh passable
##   horrible      518 377  71       94
##   bad           352 474  67       90
##   meh            17  18   4        8
##   passable       33  31  10       12
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4632          
##                  95% CI : (0.4421, 0.4845)
##     No Information Rate : 0.4228          
##     P-Value [Acc > NIR] : 7.712e-05       
##                                           
##                   Kappa : 0.1083          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: horrible Class: bad Class: meh Class: passable
## Sensitivity                   0.5630     0.5267   0.026316        0.058824
## Specificity                   0.5685     0.6011   0.978755        0.962475
## Pos Pred Value                0.4887     0.4822   0.085106        0.139535
## Neg Pred Value                0.6398     0.6429   0.930484        0.908134
## Prevalence                    0.4228     0.4136   0.069853        0.093750
## Detection Rate                0.2381     0.2178   0.001838        0.005515
## Detection Prevalence          0.4871     0.4517   0.021599        0.039522
## Balanced Accuracy             0.5658     0.5639   0.502535        0.510649

Well, this doesn’t seem like it was too helpful. We have a less than 50% chance of getting our classification correct, even we’re looking at our larger trained classes. This well may be simply due to poor correlation in data, however. We weren’t even able to reduce the data much. On another data set, PCA may be more beneficial.

Linear Discriminant Analysis

Let’s see if LDA works better for our data set. However, we know well that out data is not linear, so hopes are low.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
## 
##     select
ldapop <- MASS::lda(x=train[,1:12],grouping=train$popclass, data=train)
#ldapop <- lda(train$popclass~., data=train)
ldapop$means
##          danceability    energy      key   loudness      mode speechiness
## horrible    0.5428455 0.5489986 5.297032 -12.416375 0.6505806   0.1419429
## bad         0.5812764 0.5365977 5.202510 -10.970656 0.6454649   0.1243740
## meh         0.6052172 0.5958900 5.090580  -9.384701 0.6576087   0.1459534
## passable    0.6506609 0.5780222 5.335938  -9.330371 0.5338542   0.1695221
##          acousticness instrumentalness  liveness   valence    tempo duration_ms
## horrible    0.3886527        0.2487518 0.2341273 0.4454285 116.2633    201256.4
## bad         0.3446962        0.2397555 0.1969465 0.4803345 118.1281    211175.0
## meh         0.2827174        0.1320685 0.2033009 0.4851368 122.7796    221452.1
## passable    0.2687821        0.1741128 0.1916065 0.4711356 121.0209    185994.1

Means were found well, and everything looks good. We have to break it up for the sake of Plotly syntax, as it seemed to have some confusion due to commas in predictor names. PCA was strictly dimension reduction, but LDA also predicts, so we won’t be using kNN this time.

lda_pred <- predict(ldapop,newdata=test[,1:12],type="class")
head(lda_pred$class)
## [1] horrible horrible horrible bad      bad      bad     
## Levels: horrible bad meh passable
#lda_train <- predict(ldapop,data=train,type="class")

We know the majority of our data is in the ‘bad’ or ‘horrible’ range, so all looks good here.

Now, let’s plot it!

library(plotly)
plot(lda_pred$x[,1], lda_pred$x[,3], pch=c(16,17,18,15)[unclass(test_pc$popclass)], col=c("red","orange","yellow","green")[unclass(test$popclass)])

xaxis <- lda_pred$x[,1]
yaxis <- lda_pred$x[,2] 
zaxis <- lda_pred$x[,3]
target<- test$popclass
plot_ly(x=xaxis,y=yaxis,z=zaxis,type="scatter3d",mode="markers",color=target)

Things are not looking promising. It looks largely the same as any of our charts from principal components, even though we were able to chart all the attributes that were produced to see a visible appearance in one go this time.

We now can check our confusion matrix and look into how well we actually managed to predict data.

library(class)
mean(lda_pred$class==test$popclass)
## [1] 0.4609375
confusionMatrix(data=lda_pred$class, reference=test$popclass)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction horrible bad meh passable
##   horrible      526 423  74       96
##   bad           394 477  78      108
##   meh             0   0   0        0
##   passable        0   0   0        0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4609          
##                  95% CI : (0.4398, 0.4822)
##     No Information Rate : 0.4228          
##     P-Value [Acc > NIR] : 0.0001794       
##                                           
##                   Kappa : 0.0733          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: horrible Class: bad Class: meh Class: passable
## Sensitivity                   0.5717     0.5300    0.00000         0.00000
## Specificity                   0.5279     0.5455    1.00000         1.00000
## Pos Pred Value                0.4701     0.4513        NaN             NaN
## Neg Pred Value                0.6272     0.6220    0.93015         0.90625
## Prevalence                    0.4228     0.4136    0.06985         0.09375
## Detection Rate                0.2417     0.2192    0.00000         0.00000
## Detection Prevalence          0.5142     0.4858    0.00000         0.00000
## Balanced Accuracy             0.5498     0.5377    0.50000         0.50000

The model entirely failed for ‘okay’ and ‘passable’ songs, which is not surprising considering our model visualization. It did slightly better than PCA with kNN, however. We are effectively worse than a coin flip, made worse only by there being 4 potential classes to choose from.

Conclusion and Analysis

We chose this data since it being advertised for clustering made it seem like it would be good for kNN as well, and that the reduction would help simplify the large number of attributes. However, after interacting with it, this expectation was folly on our part. There is more that goes into making a dataset good for kNN. Thinking about the nature of our data, of bad songs on Spotify, we can also conclude that there isn’t a ton of trend with what makes a song “bad”. Perhaps from this data a genre may be able to be found via clustering, but popularity isn’t an equation of things such as tempo, energy, instruments, or anything else. Sometimes a song is just bad for content or other reasons. When it came down to it, PCA+kNN and LDA effectively made a coin flip then rated a song as ‘bad’ or ‘horrible’. While the PCA attempt was able to occasionally succeed for the smaller classes, LDA may well have been more accurate due to the fact that it stuck to the larger classes and did not try to sort anything into the smaller classes. Since the values were so scattered, increasing the amount of data likely would not have helped significantly. The reality of it is that there is not much correlation, and that we have learned that PCA nor LDA is able to find or create correlation where there is none.